-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Spark decimal multiply and divide #4613
Implement Spark decimal multiply and divide #4613
Conversation
✅ Deploy Preview for meta-velox canceled.
|
"min(38, a_precision - a_scale + b_scale + max(6, a_scale + b_precision + 1))") | ||
.integerVariable( | ||
"r_scale", | ||
"min(37, max(6, a_scale + b_precision + 1))") // if precision is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we get result type by function signature constraint, but spark decimal arithmetic is very complex, I'm not sure if we can represent the computation by string. If has, can anyone guide me?
Or can we send a function to signature, then compute the result type.
Now I compute the result type second times, it's not a good solution, and will cause function which revokes this function gets a wrong input argument type such as makeCheckOverflow
.
And in Spark, the decimal arithmetic format is check_overflow(a * b)
, we will use the argument type in function apply
, so it will be normal after this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some suggestions for here? @mbasmanova
You could just review the function signature, other part will be refactored after #4434 merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is spark decimal result type computation https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L604
I don't know how to express the if
in string constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the adjustPrecisionScale
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Is allowPrecisionLoss a session property? I imagine we would need to define 2 separate functions, one for allowPrecisionLoss=true and another for allowPrecisionLoss=false, and choose the right kind based on the value of allowPrecisionLoss.
// When `spark.sql.decimalOperations.allowPrecisionLoss` is set to true, if the precision / scale
// needed are out of the range of available values, the scale is reduced up to 6, in order to
// prevent the truncation of the integer part of the decimals.
protected def allowPrecisionLoss: Boolean = SQLConf.get.decimalOperationsAllowPrecisionLoss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is default true, current implementation is based on default value.
val DECIMAL_OPERATIONS_ALLOW_PREC_LOSS =
buildConf("spark.sql.decimalOperations.allowPrecisionLoss")
.internal()
.doc("When true (default), establishing the result type of an arithmetic operation " +
"happens according to Hive behavior and SQL ANSI 2011 specification, i.e. rounding the " +
"decimal part of the result if an exact representation is not possible. Otherwise, NULL " +
"is returned in those cases, as previously.")
.version("2.3.1")
.booleanConf
.createWithDefault(true)
allowPrecisionLoss is a rarely used config, someone else interested in it can add the new function for allowPrecisionLoss=false, maybe in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have suggestions to generate the result type by function adjustPrecisionScale
? I don't have much context for string constrait which then is computed by expression::calculation::evaluate
. @mbasmanova
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh I'm not sure I understand this question. It might be better to create GitHub issue and move this discussion over there. Please, CC: @majetideepak and @karteekmurthys
b942721
to
ce29c9b
Compare
@jinchengchenghh there is an overhaul of the Velox Type implementation. |
@majetideepak OK, I will refactor after it merges |
8a4d6b8
to
eb4928d
Compare
Looks like this boost version cannot include My build error is
|
My build command is It succeeds in my local environment, but raise following exception in CI, can you help guide how to fix this issue? @mbasmanova
|
CC: @majetideepak @kgpai Deepak, Krishna, any chance you can help here? |
@mbasmanova Taking a look. |
@jinchengchenghh Can you give more details on your local env ? Are you on linux using gcc ?
Feel free to reach out to me on Slack (@kpai) If you have any questions/need assistance. |
Thank you very much. I find in my local environment BoostSource is AUTO, but in CI, it is BUDDLED, I update the build boost script ,, this problem has been fixed. @mbasmanova @kgpai |
Can you help review? Thanks! @kgpai |
velox/common/base/BitUtil.h
Outdated
uint64_t lo = static_cast<uint64_t>(word); | ||
return (hi == 0) ? 64 + __builtin_clzll(lo) : __builtin_clzll(hi); | ||
} else { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw VELOX_UNSUPPORTED
instead of returning -1?
@jinchengchenghh Can we converge the existing DecimalUtil functionality with this? Looks like this is an improvement and could benefit Presto as well. |
sparksql::DecimalUtil::divideWithRoundUp can divide without overflow in more cases, I'm not sure if this behavior complies with Presto. Presto may expect overflow in such cases? @majetideepak |
Can you help review? Thanks! @mbasmanova |
Presto Java uses the BigInteger class for decimals. I believe it does not overflow easily as well similar to spark. |
|
That is true. It uses BigInteger support and doesn't overflow that easily. Our decimal arithmetic operations do verify if the results overflow Decimal limits and throw exceptions if they do. You can refer unit tests and e2e for that. |
return makeLongDecimalFlatVector(int128s, DECIMAL(precision, scale)); | ||
} | ||
|
||
int128_t convertStringToInt128(const std::string& value, bool& nullOutput) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use int128_t HugeInt::parse(const std::string& str)
bool nullOutput; | ||
testDecimalExpr<TypeKind::HUGEINT>( | ||
makeLongDecimalFlatVector( | ||
{convertStringToInt128("201" + std::string(31, '0'), nullOutput)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could also replace with int128_t build(uint64_t hi, uint64_t lo)
from Hugeint.h.
Understand that string form is readable but slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests, we should favor readability as top performance is not important.
int256_t inAbs = abs(in); | ||
bool isNegative = in < 0; | ||
|
||
uint128_t unsignResult = (inAbs & uint128Mask).convert_to<uint128_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible these 2 can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, T is int64_t while the variable type uses uint64_t, we cannot get template UT as uint64_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it require templating? You are using specific types in the functions. You could simply overload the functions.
or
May be you can still merge convert<int64_t> and convert<int128_t>.
Do
typedef typename std::conditional<std::is_same_v<TInput, int64_t>,uint64_t,__uint128_t>::type UT;
You can switch the unsigned type depending on T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your guide, fixed.
3460bbb
to
bc0363a
Compare
ebc1dee
to
7b73aac
Compare
Do you have further comments? @karteekmurthys |
7b73aac
to
2043985
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Thank you for the contribution.
Use Arrow gandiva BasicDecimal128 algorism to compute value, it will introduce int256_t to get much bigger result scale than presto.
For some corner cases, it is an approximate value to Spark.
This description is hard to understand as it is lacking details. What is the "Arrow gandiva BasicDecimal128 algorism" ? Would you include a link and/or summary of the algorithm?
it will introduce int256_t to get much bigger result scale than presto.
What is "much bigger result scale" ? Would you clarify?
For some corner cases, it is an approximate value to Spark.
What are these corner cases? Would you provide examples and a description of these cases?
@@ -22,6 +22,14 @@ var ([[:alpha:]][[:alnum:]_]*) | |||
"=" return Parser::token::ASSIGN; | |||
"min" return Parser::token::MIN; | |||
"max" return Parser::token::MAX; | |||
"<" return Parser::token::LT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these changes for? It would be nice to extract these into a separate PR and add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will enhance the function signature, so @majetideepak suggests me to add it in this PR, #4773 (comment)
Do you make sure we should extract these into a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Can you describe these change in the PR description? Are there tests for these? Can you give me a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/facebookincubator/velox/pull/4613/files#diff-4baee9f82f9c347f753972415d345941d2c2a110b608d480b090650171da096bR379
It is used to compute the result scale in function getResultScale
std::string getResultScale(std::string precision, std::string scale) {
return fmt::format(
"({}) <= 38 ? ({}) : max(({}) - ({}) + 38, min(({}), 6))",
precision,
scale,
scale,
precision,
scale);
}
Current Decimal multiply and divide test can cover <=
and ?
changes, I also extend the calculation to support basic operator like <
.
If the computation is wrong, we cannot get the correct result type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's extract these changes into a separate PR and add a test to velox/expression/tests/SignatureBinderTest.cpp to cover all of added functionality.
@@ -52,7 +52,9 @@ set(BOOST_HEADER_ONLY | |||
crc | |||
circular_buffer | |||
math | |||
mpl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you adding new dependencies to Velox? Please, update PR description to clearly state that and explain why these are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I include this header file, so we will need it. https://github.com/facebookincubator/velox/pull/4613/files#diff-4b21d26191cbbdb9bc90cafd83aed144a4483b1c594f224cf3bbe67f401f5f83R19
#include <boost/multiprecision/cpp_int.hpp>
.
It will introduce int256
velox/common/base/BitUtil.h
Outdated
inline int32_t countLeadingZeros(uint64_t word) { | ||
return __builtin_clzll(word); | ||
template <typename T = uint64_t> | ||
inline int32_t countLeadingZeros(T word) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to extract this change into a separate PR. This would make this PR smaller and easier to review.
velox/docs/functions/spark/math.rst
Outdated
@@ -57,6 +57,12 @@ Mathematical Functions | |||
SELECT 2L / 2L; -- 1.0 | |||
SELECT 3 / 0; -- NULL | |||
|
|||
.. spark:function:: divide(x, y) -> decimal | |||
|
|||
Returns the results of dividing x by y. The types of x and y must be decimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you update the docs for existing "divide(x, y) -> double" to clarify what types of x and y it supports?
Can x and y be decimals with different precision / scale? It so, can you make this clear? What would be the precision / scale of the result? Please, clarify.
using uint128_t = __uint128_t; | ||
using int256_t = boost::multiprecision::int256_t; | ||
|
||
class DecimalUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, document this class and its public methods.
// bits_required(x * 10^y) <= bits_required(x) + floor(log2(10^y)) + 1 | ||
// We precompute floor(log2(10^x)) + 1 for x = 0, 1, 2...75, 76 | ||
|
||
static const int32_t floorLog2PlusOne[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check coding style document and update the PR to comply. Here, the naming needs to be updated.
e83dd19
to
6a73c7b
Compare
Do you have further comments? @mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Jimmy, would you help review and merge this PR?
718a8f1
to
90c84de
Compare
Do you have further comments? Thanks! @Yuhta |
@Yuhta gentle ping, this patch is the base of decimal related functions for Spark SQL. thanks, -yuan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Looks good overall. Can you rebase and run the new functions through Spark function fuzzer to make sure there is no corner case bugs?
90c84de
to
fc5f92b
Compare
Rebased and passed spark fuzzer test, can you help merge it? Thanks! @Yuhta |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Use Arrow Gandiva BasicDecimal128 algorithm to compute value. Introduce intermediate int256_t value to avoid computation overflow.
Add dependency on Boost
multiprecision
to get access to#include <boost/multiprecision/cpp_int.hpp>
that defines int256.Arrow implementation:
Resolve: #4773